user ask
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Zhu, Dawei, Meng, Rui, Chen, Jiefeng, Li, Sujian, Pfister, Tomas, Yoon, Jinsung
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
- North America > United States > Texas > Schleicher County (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Lei, Jingdi, Gumma, Varun, Bhardwaj, Rishabh, Lim, Seok Min, Li, Chuan, Zadeh, Amir, Poria, Soujanya
Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models - Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96% - fall far short of reliable operational safety, while GPT models plateau in the 62-73% range, Phi achieves only mid-level scores (48-70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.
- North America > Canada (0.14)
- Asia > China (0.14)
- Asia > Singapore (0.04)
- (3 more...)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Law (1.00)
- (10 more...)
Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
Wen, James, Nalawade, Sahil, Liang, Zhiwei, Bielick, Catherine, Boston, Marisa Ferrara, Chowdhury, Alexander, Collin, Adele, De Angelis, Luigi, Ellen, Jacob, Frase, Heather, Gameiro, Rodrigo R., Gutierrez, Juan Manuel, Kadam, Pooja, Keceli, Murat, Krishnamurthy, Srikanth, Kwok, Anne, Lu, Yanan Lance, Mattie, Heather, McCoy, Liam G., Miller, Katherine, Morgan, Allison C., Moerig, Marlene Louisa, Nguyen, Trang, Owen-Post, Alexander, Ruiz, Alex D., Puchala, Sreekar Reddy, Samineni, Soujanya, Tohyama, Takeshi, Ullanat, Varun, Valenza, Carmine, Velez, Camilo, Wang, Pengcheng, Wuest, Anna, Zhou, Yuxiang, Zhu, Yingde, Johnson, Jason M., Lenane, Naomi, Willcox, Jennifer, Vitiello, Francis J., Celi, Leo Anthony G., Umeton, Renato
Background: Generative artificial intelligence (AI) deployment in academic medical settings raises copyright compliance concerns. Dana-Farber Cancer Institute implemented GPT4DFCI, an internal generative AI tool utilizing OpenAI models, that is approved for enterprise use in research and operations. Given (1) the exceptionally broad adoption of the tool in our organization, (2) our research mission, and (3) the shared responsibility model required to benefit from Customer Copyright Commitment in Azure OpenAI Service products, we deemed rigorous copyright compliance testing necessary. Case Description: We conducted a structured red teaming exercise in Nov. 2024, with 42 participants from academic, industry, and government institutions. Four teams attempted to extract copyrighted content from GPT4DFCI across four domains: literary works, news articles, scientific publications, and access-restricted clinical notes. Teams successfully extracted verbatim book dedications and near-exact passages through various strategies. News article extraction failed despite jailbreak attempts. Scientific article reproduction yielded only high-level summaries. Clinical note testing revealed appropriate privacy safeguards. Discussion: The successful extraction of literary content indicates potential copyrighted material presence in training data, necessitating inference-time filtering. Differential success rates across content types suggest varying protective mechanisms. The event led to implementation of a copyright-specific meta-prompt in GPT4DFCI; this mitigation has been in production since Jan. 2025. Conclusion: Systematic red teaming revealed specific vulnerabilities in generative AI copyright compliance, leading to concrete mitigation strategies. Academic medical institutions deploying generative AI should implement continuous testing protocols to ensure legal and ethical compliance.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)
- (11 more...)
- Law (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.69)
- Health & Medicine > Therapeutic Area > Oncology (0.49)
- Government > Regional Government > North America Government > United States Government (0.46)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Xu, Zhangchen, Liu, Yang, Yin, Yueqin, Zhou, Mingyuan, Poovendran, Radha
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
- North America > United States > New York (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (3 more...)
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
Wang, Haoyu, Ma, Guozheng, Meng, Ziqiao, Qin, Zeyu, Shen, Li, Zhang, Zhong, Wu, Bingzhe, Liu, Liu, Bian, Yatao, Xu, Tingyang, Wang, Xueqian, Zhao, Peilin
Self-alignment is an effective way to reduce the cost of human annotation while ensuring promising model capability. However, most current methods complete the data collection and training steps in a single round, which may overlook the continuously improving ability of self-aligned models. This gives rise to a key query: What if we do multi-time bootstrapping self-alignment? Does this strategy enhance model performance or lead to rapid degradation? In this paper, our pioneering exploration delves into the impact of bootstrapping self-alignment on large language models. Our findings reveal that bootstrapping self-alignment markedly surpasses the single-round approach, by guaranteeing data diversity from in-context learning. To further exploit the capabilities of bootstrapping, we investigate and adjust the training order of data, which yields improved performance of the model. Drawing on these findings, we propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance. Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance. Our experiments demonstrate the efficiency of SOFT (SOFT+) across various classification and generation tasks, highlighting the potential of bootstrapping self-alignment on continually enhancing model alignment performance.
- North America > Canada (0.14)
- Asia > China (0.05)
- Europe > Spain (0.04)
- (9 more...)
- Personal (1.00)
- Research Report > New Finding (0.87)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Health & Medicine > Consumer Health (1.00)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Google now lets users ask for images of minors to be removed from Search
Google has activated a safety feature that lets minors under 18 request that images of themselves be removed from search results, The Verge has reported. Google first announced the option back in August as part of a slate of new safety measures for kids, but it's now rolling out widely to users. Google said it will remove any images of minors "with the exception of case of compelling public interest or newsworthiness." The requests can be made by minors, their parents, guardians or other legal representatives. To do so, you'll need to supply the URLs you want removed, the name and age of the minor and the name of the person acting on their behalf.
- Information Technology > Information Management > Search (0.45)
- Information Technology > Sensing and Signal Processing > Image Processing (0.40)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.40)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.40)
Design Framework for Chatbots – Chatbots Magazine
When I started designing chatbots for BEEVA almost a year ago, I applied some of my UX knowledge and did some unsuccessful research looking for tools that could fit my needs. Actually, I was quite amazed that I couldn't find practical literature about the topic. There are tons of chatbots out there, but there's little about how companies really get hands on. I already shared some of my findings here, and here, with tools I found, general knowledge about designing chatbots and UX design applied on chatbots, but I think it would be great to make a deeper explanation about how I exactly face the situation on a regular basis. While many people immediately start thinking about how to manage the user flow, I separate my process into 4 different steps: the bot scope, the chatbot personality, a prioritized list of must-have features and the chatbot flow.
Google assistant makes memes when users ask for them
Have you ever had a great idea for a meme, but didn't feel going to the effort to make it yourself? If the answer is yes, then a new Google Assistant feature may help you out. Called'Meme Buddy,' the feature lets you make meme using your voice. By using the Google Assistant on your phone, you can ask meme buddy to find a photo and give it a caption - or ask it to make a completely random meme for you. To use Meme Buddy, open the Google Assistant on your phone and say'Talk to Meme Buddy.'
Amazon launches Spotify rival in the UK
Amazon is launching its Spotify rival, Amazon Music Unlimited, in the UK with a headline price of £9.99 a month and access to a library of "over 40 million" songs. Those figures are comparable to competitors like Spotify and Apple Music, but where Amazon is focusing its competition is on a pair of special offers, as well as the tight integration with its own hardware ecosystem. The service is cheaper for those with an Amazon Prime subscription, the company's all-you-can-eat free delivery package. The monthly cost drops by £2, to £7.99, and an annual payment plan is available for £79 a year, taking the equivalent monthly fee to £6.58. That's how Amazon is hoping to entice users to upgrade from its other free streaming service, Amazon Prime Music.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)